Freeze frame

28 June 2010

Tier0, CERN Computing Centre


Back in May we reported a new approach being introduced at Tier0: freezing the physics content of the software to ensure that data could be fairly compared from one run to the next and coherent datasets could be built up for the summer conferences. Two months in, the pros and cons of the new strategy are becoming clear…

In a nutshell, the point of freezing the software, for both real and simulated data, is to allow new data to be continuously tacked on to the end of the existing dataset. The big ‘pro’ of this approach is that new data will be immediately comparable with that which has already been reprocessed.

“From my point of view, it also means that I have fewer requests for [software] updates, because some of them are so obviously incompatible,” explains David C๔t้, who is just coming to the end of an 18-month stint as Prompt Reconstruction Operations Coordinator (PROC). “On the other hand, each update request is more work because we need to understand what it's doing in much more detail.”

This is because changes must be validated one-by-one, and where before developers could make change requests themselves, the mechanics of this now falls solely to David and his co-PROC, Walter Lampl. Changes are still filed because, although the physics content of the software is fixed, says David, “We try to push the concept of being frozen to the limit.”

“When people ask for an update, it's always for a good reason,” he argues, “and psychologically it's difficult to block improvements.” Total freezing is both unnecessary – because there are things that can be changed without affecting the reconstruction output – and impossible – because there are things that must be up-to-date for data-taking. Detector monitoring (mapping dead channels, and observing how the tracking and combined reconstruction is working) is a good example.

Elsewhere, updating for improved detector alignment is another change that has been deemed acceptable. “That's like a software change in some sense, in that the reconstruction results will be different from one day to the next,” David explains. “But the agreement with Monte Carlo should improve as a result of this, and not decrease. So we accept it.”

“It's sort of an art that we have to learn: how to plug in as many updates as possible and still not change the physics!” he smiles.

Up until April, the primary concern was whether or not altered code still functioned. Now, consistency in the reconstruction output is suddenly a factor: “In the mornings, we have to check not only that it runs, but that the results are identical [to those produced by previous versions of the code, on a test sample],” David explains.

Identical results production – or absence thereof – is one area in which the Tier0 freeze has proved to be a bit of a headache though. A couple of ‘irreproducibility bugs’ cause the code to spit out different results each time it is run on the same events, even when no changes are applied in the interim.

One such bug was spotted two weeks before the Tier0 was frozen for the May reprocessing, requiring everyone to go back and scour their code to see if it was the source. But in the absence of an obvious smoking gun, the mystery wasn't solved in time to be fixed before the campaign began in earnest, ahead of ICHEP.

Unfortunately, the culprit was deep in the TRT code, which made it all the harder to spot as it had lain dormant for years. It was only when ATLAS started considering particles with much lower transverse momentum (a key difference in the scope of the April and May reprocessing campaigns), that an incorrect assumption in the TRT code – that all particles would necessarily leave the sub-detector and go on into the next layer of calorimeter – manifested itself as a spanner in the works.

Before May, all the particles under consideration had had enough momentum to fulfil this criterion, but with the broader spectrum of momenta studied in May, ‘looper’ particles without enough transverse momentum to escape the TRT were wrongly accounted for in the code. Under the frozen Tier0, even though this problem has now been traced, it can't be fixed until the whole software is updated at the next full reprocessing.

“It's tough for bug-fixers, if you have a known bug and you cannot fix it,” David sympathises. So far though, this is the only un-patchable bug of any consequence that has been found, and the effect on ICHEP data should be negligible: “We have made quite some effort to assess it, and concluded that it's not really a problem, except for specialised things like tracking commissioning itself.”

The Tier0 has been frozen for about five weeks this time around, but the answer to the question of how long it is foreseen to stay so is open ended. “We're discussing it right now, and every other week since a long time,” smiles David.

Bugs and glitches are noted on the DataMCforAnalysis TWiki as they are uncovered, and until there is a fatal problem in the data or there is significant benefit in moving to a new software update, things will stay as they are. Ultimately, like a game of Buckaroo, when enough niggling problems have mounted up, the time will come for action.

“The problems are generally minor and recoverable. But if you add up a long list, it becomes more and more difficult to manage.” In the short term though, says David, “The goal is to remain frozen until [the] ICHEP [data cut-off date], to provide the maximum possible dataset.”

--------------------------------------------------------------------------------------------------------------------------------
Collaborators are invited to attend the new ATLAS Reconstruction Meeting on Tuesdays at 16:00, which aims to bring together the physics community and reconstruction experts to discuss the timeframe of new releases. Slides form the last meeting are available here.

 

Ceri Perkins

ATLAS e-News